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Abstract 

The proximal problem for structured penalties obtained via convex relaxations of submodular func¬ 
tions is known to be equivalent to minimizing separable convex functions over the corresponding submod¬ 
ular polyhedra. In this paper, we reveal a comprehensive class of structured penalties for which penalties 
this problem can be solved via an efficiently solvable class of parametric maxflow optimization. We then 
show that the parametric maxflow algorithm proposed by Gallo et al. m\ and its variants, which runs, in 
the worst-case, at the cost of only a constant factor of a single computation of the corresponding maxflow 
optimization, can be adapted to solve the proximal problems for those penalties. Several existing struc¬ 
tured penalties satisfy these conditions; thus, regularized learning with these penalties is solvable quickly 
using the parametric maxflow algorithm. We also investigate the empirical runtime performance of the 
proposed framework. 


1 Introduction 

Learning with structural information in data has been a primary interest in machine learning. Regularization 
with structured sparsity-inducing penalties, such as group Lasso [Ml ES] and (generalized) fused Lasso 
[Mll^, has been shown to achieve high predictive performance and solutions that are easier to interpret, 
and has been successfully applied to a broad range of applications, including bioinfomatics ISlISIlEilEHlIlH], 
computer vision [Ml [SSI |35| , natural language processing [HIM] and signal processing [171 [Ml- 

Recently, it has been revealed that many of the existing structured sparsity-inducing penalties can be in¬ 
terpreted as convex relaxations of submodular functions [H |4T|. Based on this result, the calculation of 
the proximal operators for such penalties is known to be reduced to the minimization of separable convex 
functions over the corresponding submodular polyhedra, which can be solved via the iteration of submodular 
minimization. However, minimizing a submodular function is not effectively scalable (due to its generality); 
thus, an unavoidable next step is to clarify when the problem is solvable as a special case that can be cal¬ 
culated faster, especially cases that are solvable as an efficiently solvable class of network flow optimization. 
Several specific problems are known to be solvable via such network flow optimization. For example, a class 
of the total variation, which is equivalent to generalized fused Lasso (GFL), is known to be solved via para¬ 
metric maxflows [iniilS]- Mairal et al. (2011) [M] and Mairal & Yu (2013) [37| proposed parametric maxflow 
algorithms for Zi/^oo-regularization and the path-coding, respectively. In addition, Takeuchi et al. (2015) [M] 
recently proposed a generalization of GFL to a hyper-graph case, which they call higher-order fused Lasso, 
with a parametric maxflow algorithm. 

In this paper, we first develop sufficient conditions for estimating whether a submodular function corre¬ 
sponding to a given structured penalty is graph-representable, i.e., realizable as a projection of a graph-cut 
function with auxiliary nodes. Several existing structured penalties from submodular functions, such as 
(overlapping) grouped penalty and (generalized) fused penalty, satisfy these conditions. Then, we show that 
the parametric maxflow algorithm proposed by Gallo et al. m and its variants (hereafter, we call those the 
GGT-type algorithms) is applicable to calculate the proximal problems for penalties obtained via convex 
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relaxation of such submodular functions, which runs at the cost of only a constant factor in the worst-case 
time bound of the corresponding maxflow optimization. Also, we empirically investigate the comparative 
performance of the proposed framework against existing algorithms. 

Thus, the main contribution of this work is two-fold: (i) we develop sufficient conditions (with concrete 
ways of constructing the corresponding networks) for the class of structured penalties that can be solved 
via a parametric maxflow algorithm and (ii) we show that an efficient parametric flow algorithm can be 
applied to the proximal problem for such penalties. Note that the first one is closely related to the class of 
energy minimization problems that can be solved with the so-called graph-cut algorithm, which has been 
discussed actively in computer vision [301129] . Similar discussions are found in the context of realization of a 
submodular function as a cut function in combinatorial optimization [ilMlIIS]. Our current work would give 
a relation to such discussions to structured regularized learning. And as for the second one, our proposed 
formulation gives an unified view of the class of structured regularization that can be solved as a parametric 
maxflow problem, which generalizes, extends or connects several existing works that have been separately 
discussed to date, such as uni nnnsg miETiig 154] . without increasing the essential theoretical run-time 
bound. 

The remainder of this paper is organized as follows. We first define notations and describe preliminaries in 
Section Then, in Section we give a brief review of structured penalties obtained as convex relaxations of 
submodular functions. In Section]^ we describe the sufficient condition for estimating whether the proximal 
problem for a given penalty is solvable via network flow optimization. In Section]^ we develop the parametric 
flow algorithm to proximal problems for penalties satisfying this condition. In Section we describe related 
work. Finally, we show runtime comparisons for calculating the proximal problem for the penalties by the 
proposed and existing algorithms in Section]^ and conclude the paper in SectionAll proofs are given in 
Appendix 


2 Notations and Preliminaries 

In this section, we introduce notations used in this paper, and give brief reviews on submodular functions 
in Section |2.1| and network flow optimization in Section |2.2| 

2.1 Submodular Functions 

Let d be a positive integer and V := {1, 2,..., d}. We denote the complement of A by A for A C V, i.e., 
A = V \ A. For a real vector w = {wi)i^v S and a subset A C V, define u;(A) := ^ 

function F: 2^ —>■ K is called submodular if 

F(A) -k F{B) > F{A f^B) + F{A U B) 


for any A, B C 1/ [HI E] . 

We denote by F the Lovdsz extension of a set function F with F{%) = 0, i.e., F : —>■ K is a continuous 

function defined as, for each w S K'^, 

d 

:= ■ • ■ ’-^d) - ■ ■ ■ Ai-i})), 

i=l 

where ji,j 2 , ■■■ ,jd & V are the distinct indices corresponding to a permutation that arranges the entries of 
w in nonincreasing order, i.e., > ■ ■ ■ > [55] . 

For a submodular function F with A(0) = 0, the submodular polyhedron P{F) C and the base polyhedron 
B{F) C are respectively defined as 

P{F) : = { a; e I a;(A) < F{A) (VA CV)} and 
B{F) : = {xe P{F) I x{V) = F{V) }. 
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We define P+{F) := n P{F). 

For an integer i with 0 < i < d, let denote the set of i-element subsets of V. For any set function F, 
there uniquely exist functions F^’‘'>: (Y) —>• K (i = 0,1,..., d) such that 

|A| 

i=0 Yg{^) 


where, for each i = 0,1,..., d, 


F«(A)= (Ag(V)) 

rcA 

by the Mobius inversion formula (see, for example, HD- A set function F is said to be of order k for an 
integer k with 0 < fc < d if ^ 0 and = 0{k + l<i<d). 


2.2 Flow Terminology 

Suppose we are given a directed network A/’= {U, E) with a finite vertex set U and an edge set E C U x U, 
a distinguished source vertex s G U, a distinguished sink vertex t G U, and a nonnegative capacity c{u, v) for 
each edge (u, v) G E. Define c{u, v) := 0 for each pair (u, v) G {U x U)\E. A flow / on A/” is a real-valued 
function on vertex pairs satisfying the following three constraints: 


f{u,v) < c{u,v) 

for 

{u,v) G U X U 

(capacity). 
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for 

{u,v) gU xU 

(antisymmetry), and 


for 

V G U \ {s,t} 

(conservation). 


The value of flow / is maximum flow is a flow of maximum value. For disjoint A, B F V, 

the capacity of pair (A, B) is defined as c(A, B) := veB ^)- ^ ^ vertex partition (i.e., 

CLiC = U,CnC = f/)) such that s G C and t G C. A minimum cut is a cut of minimum capacity. The 
capacity constraint implies that for any flow / and any cut (C, C), we have f(C, C) < c{C, C), which implies 
that the value of a maximum flow is at most the capacity of a minimum cut. The max-flow min-cut theorem 
of [131 states that these two quantities are equal. 


3 Penalties via Convex Relaxation of Submodular Functions 


We briefly review structured penalties through convex relaxations of submodular functions, which cover 
several known structured sparsity-inducing penalties, in Subsection] 3. 1[ and then the existing optimization 


methods for those proximal problems in Subsection 3.2 


3.1 Structured Penalties from Submodular Functions 

Structured penalties obtained via convex relaxations of submodular functions can be categorized into two 


types. Here, we review these respectively in Sections 3.1.1 and 3.1.2 


3.1.1 Penalty via ^p-relaxation of Nondecreasing Submodular Function 

The first type of the penalty from a submodular function is defined through convex relaxation with the 
£p-norm mniis]. For this type, a submodular function F is required to be non-decreasing. To define this 
penalty, we first consider a function h : -G M that penalizes both supports and ?p-norm on the supports; 


h{w) = -b ^E{supp{w)), 


( 1 ) 
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where l/p+ 1/r = 1. Note that when p tends to infinity, function g tends to i^(supp(u;)) restricted to the 
Zoo-ball. The following is known for any p £ (1, -l-oo). 

Proposition 1 l|41jl. Let F be a non-decreaing function s.t. T’({*}) > 0 for all i gV. The tightest convex 
homogeneous lower-bound of h{w) is a norm, denoted by Clpp, such that its dual norm equals to, for s G 




sup p.A)l/r- 
ACV.A^e T [A) ! 




( 2 ) 


Note that, if F is submodular, then only stable inseparable sets may be kept in the definition of in 
Eq. 1^. From the above definition, we obtain, for any w G 

riF^p{w) = sup s such that n*Fp{w) < 1 

sSR'^ 

= sup s such that VA C V, ||s^||^ < F{A) 

sSR'^ 

= sup (3) 

tGP+(F) 

where we change the variables as ti = s[. The first equality is obtained using the Fenchel duality. Con¬ 
sequently, the norm Llpp is computed with a separable form over (the positive part of) the corresponding 
submodular polyhedron. 

It is easy to check that, if we use F{A) = |A|, the Upp is equivalent to the £p-regularization. And, if we use 
F{A) = J2geQ ™in{|Ang|, 1} for a group of variables Q, then Qf^p is equivalent to the (possibly, overlapping) 
^i/^oo and non-overlapping £i/£p group regularizations or provides group sparsity similar to the overlapping 
£il£p group regularization [4T1 [5]. 


3.1.2 Penalty by the Lovasz Extension of Submodular Function 

The other type of penalty is defined as the Lovasz extension, i.e., fco-relaxation, of a submodular function F 
with F{%) = F{y) = 0. This is known to make some of the components of w equal when used as a 
regularizer [3]. A representative example of this type of penalty is the generalized fused Lasso (GFL), which 
is defined for a given undirected network JV = {V,E) as 

nfl(tc) = ^ aij\wi-Wj\, 

(iJAE 

where aij is the weight on each pair (*, j). This penalty is known to be equivalent to the Lovasz extension of 
a cut function on A/”, i.e., F{A) = J^ieA jev\A 0■ This can be extended to a hypergraph LL — {V, E) 
with non-negative weight for each hyperedge e G E, where the Lovasz extension of a hypergraph cut 
function F{A) = SeGF:enA^ 0 ,enA 5^0 glves the hypergraph regularization £lhriw) = ae(maxige - 

mini^eWjY [12] • 

From the definition, the Lovasz extension of a submodular function with 1^(0) = 0 can be represented as a 
greedy solution over the submodular polyhedron [33], i.e., 

F{w)= sup '^L\w,\. 

tGP+{F)^^y 

which is in fact the equivalent form with Eq. (|^ for r = 1 (i.e., p = oo). 

3.2 Proximal Problem for Submodular Penalties 

The above penalties have a common form, for a (normalized) submodular function E, 

ilp^piw) := sup (4) 

tGP+{F)^^y 
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where p G (l,+oo) and l/p+ Ijr = 1. However, note that, if F is not nondecreasing, then Qp^p^w) does 
not necessarily has the duality as described in Section 3.1 When using the norm Qf,p as a regularizer, we 
solve the following problem for some (convex and smooth) loss I: —>■ K that corresponds to the respective 
learning task: 

min l{w) + A • flppiw) (A > 0). 

Since the objective of this problem is the sum of smooth and non-smooth convex functions, a major option 
for its optimization is the proximal gradient method, such as FISTA (Fast Iterative Shrinkage-Thresholding 
Algorithm) [7]. Thus, our necessary step is to compute iteratively the proximal operator 


Prox^np J^:) := argmin i|| 2 ; - w\\l + A • 


(5) 


where z G . From the definition Q , we can calculate prox_)^Q^ ^ by solving 


mm max -\\w — z\\ 
WGRV teP+{F) 2 


iGV 


1 /r 


\wi\ = max > 
teP+{F)^ 


mm 

WiGR 




= - min E'(/'»(**), 

i£V 


( 6 ) 


where ~ minu,igR{|(rUi — Zi)^ + \t\^'^\wiW. Thus, solving the proximal problem equals minimizing 

a separable convex function over the submodular polyhedron. 

Based on the above formulation, Obozinski & Bach (2012) [H] recently suggested a divide-and-conquer 
algorithm as an adaptation of the decomposition algorithm by Groenevelt (1991) [21] for penalties from 
general submodular functions (for the case of p = 2). A more general version of this approach was also 
developed by Bach (2013) [5|. However, a straightforward implementation of this approach yields 0(c?)-time 
calculation of submodular minimization, which could be time-consuming especially in large problems. 

We address this issue by considering it from the following two perspectives. First, in Section]^ we develop an 
explicit sufficient conditions for determining whether the proximal problem for a given penalty can be solved 
through maximum flow optimization rather than submodular minimization. Maximum flow optimization can 
be regarded as an efficiently-solvable special case of submodular minimization, and is known to be much faster 
than submodular minimization in general; thus, this could be useful to judge whether a given penalty can 
be dealt with in a scalable manner as a regularizer. The respective structured penalties from submodular 
functions mentioned above are in fact instances of this case. On that basis, in Section we develop a 
procedure for problem that runs at the cost of only a constant factor in its worst-case time bound of the 
maxflow calculation rather than the 0((i)-time calculation of the straightforward implementation. In other 
words, we discuss whether an efficient parametric maxflow algorithm is applicable to the current problem. 


4 Graph-Representable Penalties 

In this section, we develop sufficient conditions for determining whether the proximal problem for a given 
structured penalty is solvable through an efficiently-solvable class of network flow optimization. We also 
describe a concrete procedure to construct the corresponding network. 


4.1 Graph-Representable Set Functions 

The currently-known best complexity of minimizing a general submodular function is 0{cP + (fi EO), where 
EO is the cost of evaluating a function value [22|. Although there exist practically faster algorithms, such 
as the minimum-norm-point algorithm |15j as well as faster algorithms for special cases (e.g., Queyranne’s 
algorithm for symmetric submodular functions |44 ] ). their scalability would not be practically sufficient, 
especially if we must solve submodular minimization several times, which is the current case. In addition. 


5 






Figure 1: Construction of network JV for a graph-representable function in Theorem (Left(a): Condition 
(i) and Right (b)-(e): Conditions (ii) and (hi)). 


it is well known that a cut function (which is almost equivalent to a second order submodular function |16j ') 
can be minimized much faster through calculation of maxflows over the corresponding network. Given a 
directed network Af = {V, E) with nonnegative capacity c(e) on each edge e £ E, a. cut function K_\f : R —>• M 
is defined as 

K^iA) := c(e) I e S } (ACV), 

where d^*(A) denotes the set of edges leaving A in J\f. If J\f consists of d nodes and m edges, the currently 
best runtime bound for the minimization is 0{md) |43j . Albeit it is a better run-time bound, the empirical 
complexity is often much better with practical fast algorithms, e.g., [Min]. 

However, the expressive power of a cut function is limited. Therefore, in order to balance between expres¬ 
siveness and computational simplicity, using a higher-order function that is represented as a cut function 
with auxiliary nodes is often helpful. Such a function is sometimes referred to as graph-representahle |26j0 
and defined as follows. Let U = R U IF U {s, t} for some finite set W with IF n F = 0 and distinct elements 
s,t Li IF, and let A/” = (17, E) be a directed network with nonnegative capacity c(e) on each edge e £ E. 
Then, define a set function F: 2'^ —)■ M as 

F(A) := mn K^({s}UAuy)-lC'F {ACV), 

where Cp € K is an arbitrary constant, and such E is said to be graph-representable. If IF is empty, this 
function coincides with a cut function. The submodularity of this function is derived from the classical result 
of Megiddo (1974) [38] on network flow problems with multiple terminals (see, for the proof, [40]). 

4.2 Sufficient Conditions and Network Construction 

As described in Section]^ if the corresponding set function F for norm Hf.p is graph-representable, then its 
proximal problem ([^ can be efficiently solved through a parametric maxflow computation. Hereafter, we 
refer to such a penalty as a graph-representable penalty^ which is defined as follows. 

^This class of functions is closely related to the class of energy minimization problems that can be solved by the so-called 
graph-cut algorithm 0130]. Related results are also found in the context of realization of a submodular function as a cut 
function in combinatorial optimization 0[l6]. 
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Definition 2 (Graph-representable penalty). A penalty defined in Proposition^is said to be graph-representable 
if the set function F on supports is graph-representable. 


Here, we present three types of sufficient conditions for a penalty ilp.p from a given submodular function F 
(as described in Section 3.1) to be graph-representable by constructing networks representing F. The first 
one is mentioned as “truncations,” where a function F is graph-representable by just one additional node 
and also refer [55] or [10]). The second one is closely related to 0, and the third one is 
for which we describe concrete procedures to construct networks (see Figure for the 


(see Figure 1(a) 


derived from 
construction). 

Theorem 3. A set function F with one of the following conditions is graph-representable. 


(i) F{A) = min{i(;(H), y} for some w G and y G M+. 

(ii) F is submodular and of order at most three, i.e., F^'^'> = 0 for i = 4, 5,..., d. 

(hi) F has no positive term of order at least two, i.e., < 0 for f = 2, 3,..., d. 


Remark. ft should be noted that the sum of graph-representable submodular functions is also graph- 
representable by considering the union of the corresponding networks. 


4.3 Examples 

A submodular function F{A) = 1}, which gives the grouped-type regularization, is graph- 

representable since each term min{]Any|, 1} = min{eg(A), 1} is guaranteed to be so from Condition (i). The 
cost for constructing the corresponding network for this function is 0{\Q\) and the number of the additional 
nodes is \Q\. 

Condition (ii) is a generalization of the condition that a cut function F(A) = ^ network 

(V, E) can be solved with maximum flows, i.e., positive weights for all i,j G E. 

Besides, a hypergraph cut function F{A) = X]eG£;-enyi^0 enA^$ confirmed as graph-representable 

as follows. For each hyperedge e G E, define FeA{A) := Oe • min{|A D e|, 1}, Ee, 2 {A) := —Ce if e C A, and 
Ee, 2 {A) := 0 otherwise. Then, F^^\e) = —Og and Fg|^'^(A) = 0 for A ^ e. Hence, F^p and Fe _2 satisfy 
Conditions (i) and (hi), respectively, and it is easy to see F = F Fe, 2 )- The network construction 

requires 0(||F||) time, where ||F|| = Jf,eGE l^l- 


5 Parametric Maxflows for prox^^ ( 2 :) 


We describe how the proximal problem (§ for a network representable penalty is solvable with an adaptation 
of the GGT-type algorithms. We first derive a parametric formulation of this problem in Subsection |5.1[ 


and then develop the algorithm in Subsection 5.2 


5.1 Parametric Formulation 


We address a parametric formulation of problem ([^. To this end, we first consider 


jnin y2^ev^^in)■ ( 7 ) 

Note that the above optimization is over B+{F) in place of P+{F). In the following parts of this section, we 
suppose that F is non-decreasing (thus, B+{F) coincides with B{F)). Although this does not necessarily 
hold for our case, we can show the following: 

Lemma 4. LetbGM.^ and F be submodular, and set fi := svip^i^ylO, F{V\{i}) — F{V)}/bi. Then, F-\- fib 
is a nondecreasing submodular function. Also, r* is optimal to problem Q for F if and only if r* -|- fib is 
optimal to problem 0 for F -I- fib. 
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Thus, for F that is not non-decreasing, we can apply the algorithm developed below and recover an optimal 
solution to the original problem by transforming it to a non-decreasing one as in this lemma. 

First, we define an interval J S K as 


J := P|{V’i(7‘0 I "Ti G (dom'0inK+)} (= (- 00 , 0]). 

i&V 

Let r* be an optimal solntion to problem Q . Denote the distinct values of ■0' (r*) by < • • • < ^^, and let 
:= —00 and := -boo. Let A* := {f G F | < ^* } for j = 0,1, ..., fc -b 1. Also, let 

Fa{A) := F{A)-J2^eAMo‘) {ot&J), 

where (j>i[a) = ip'A^ioi) {a £ J \ {0}) or {\zi\/XY {a = 0), and means an inverse function. 

Lemma 5. Let a G J. If Yj < cn < A* is a minimizer of Fa- If a = Yf, A*_i is a minimal minimizer 
and A* is a maximal minimizer of F ^,. 


This is obtained, in Lemma 4 of |39] . by replacing the assumption on the strict convexity of Yi with the 
monotonisity of the function in the region under consideration. As discussed in |39j , this lemma implies that 
problem § can be reduced to the following parametric problem: 

min Fa(A) for all a G J. (8) 

ACV 


That is, once we have the chain of solutions Ag C • • • C to problem Q for all a G 

optimal solution to problem Q as for j = 0,..., fc 

T* = (i G \ A*) with s.t. F(A*_^Y - = EieA* 


J, we can obtain an 
\A^(/)i(a). (9) 


The key here is that, if function F is graph-representable, problem (|^ can be solved as a parametric 
minimum-cut (equivalently, a parametric maxflow) problem on Al, where c(s,v) for v G V are functions of 
a (since 4)i{a) > 0 for a G J), as will be stated in the next subsection. 

Once we have a solution r* to problem Q, we can then obtain a solution to problem ([^ as follows. 
Corollary 6. If t* be an optimal solution to problem Q, then the one to problem ([^ is given by 


z, - sign(zi)A(max(T*,0))^/’' if 0 < r* < (|zi|/A)’', 
0 otherwise. 


5.2 Algorithm Description 

As mentioned above, if the penalty is network representable, then problem § is solved as a parametric 
maxflow problem on network Af, where capacities Ca(s,v) for r G F are Ca(s,v) = + const.) and the 

others are constants for a (note that (/>i(a) > 0 for a G J). Since ifi is convex, those capacities satisfy the 
conditions of the monotone source-sink class of problems^ i.e., 

1. c{s,v) is a non-decreasing function of a for all v gU, 

2. c(vA) is a non-increasing function of a for all v GU, and 

3. c(u, v) is constant for all it, w G 17 \ {s, t\. 

Therefore, for a given on-line sequence of parameter values ai < ■ ■ ■ < ak, there exists a parametric maxflow 
algorithm that computes minimum cuts (Ai, Ai), • • • , (Afc, A^) on the network such that Ai C • • • C A^, 
and runs at the cost of only a constant factor in the worst-case time bound of a single maxflow computation. 


Algorithm 1 Parametric preflow algorithm for the computation of prox^i^Q^ ^ ( 2 :). 

Input: M = {U,E). Output: ii;* = prox_;^Q^^( 2 :). 

1: Compute ao as in Eq. ( [l0| and set ak+i 0. Compute maximum flows /o and fk+i, and minimum 
cuts (Co, Co) and (Cfe+i,Cfc+i) for ao and au+i such that |Co| and |Cfc+i| are maximum and minimum 
by applying the preflow algorithm to M, respectively. Form N' from N by shrinking the nodes in Co 
and in C^+i to single nodes respectively, eliminating loops, and combining multiple arcs by adding their 
capacities. 

2: If N' has at least three vertices, let /q, be respectively the flows in N' corresponding to /o, //+i. 

Then, perform Slice{M',ao,ak+i ,/^,/^+i,Co,Cfc+i). 

3: Compute w* as in Corollary and return w*. 

Procedure Slice{Af, a;, a„, A;, 

1; Find d such that ca({s}, U \ {s}) = c(U \ {t}, {t}) (cf. Lemma[^. 

2: Run the preflow algorithm for d on Af starting with the preflow /( formed by increasing // on arcs (s, v) 
to saturate them and decreasing /; on arcs (v, t) to meet the capacity constraints for v € U. As an initial 
valid labeling, use d{v)='min{dfi {v, t), dfi (v, s)+(|17|—2)}. Find the minimal and maximal minimum cuts 

(C, C) and (C',C^) for d, respectively. 

3: If C^= {t}, set y^^<—F(Au)—F(Ai). Otherwise, run S'/ice(A/'(C'), d, a„,/,/„, C, A„). AndifC^js}, 
then run Slice{N'{C),ai,a, fi, f,Ai,C ). 


If parametric capacities in the monotone source-sink class of problems are linear for a, all breakpoints^ i.e., 
a value of parameter a at which the capacity for the corresponding cut changes, can also be found at the 
cost of a constant factor in the worst-case time bound of a single maxflow computation using the GGT-type 
algorithms. However, this is generally not true for non-linear capacities because we must solve nonlinear 
equations to identify such a parameter value [5D]. Although this is the case for our situation in general, 
we can find such a value in closed-form for the important cases p = 2, -l-oo due to its specific form of the 
problem. 

Lemma 7. For network Af corresponding to a graph-representative penalty, the value of a such that 

E«GyCa(s, v) = “ E^gC/\vC(s, v) 

is found in close form for p = 2, -l-oo. 

The concrete derivations of these closed-forms are described in Appendix For the other cases, we can at 
least apply some line search for finding such value of a due to the monotonicity of (fi. Thus, we can adapt the 
procedure of the GGT-type algorithms to find the chain of solutions Ai C • • • C A^,, which results in giving 
an optimal solution to problem (§, as shown in Algorithm (a brief review on the preflow-push algorithm 
used in Algorithm is given in Appendix . 

Theorem 8. Algorithm^is correct, and runs at the cost of a constant factor in the worst-case time bound 
of a single maxflow computation. For example, it runs in 0{dm\og{df/m)) with dynamic trees. 

That is, although the prefLow algorithm is applied several times, the total runtime of Algorithm [T] is equivalent 
to that of a single application of the preflow algorithm to the original network. 

The interval (ao, a^+i) is chosen such that it covers all possible breakpoints ai,..., a^. In other words, it 
suffices to select a sufficiently small ao so that for each vertex v such that {s,v) is of nonconstant capacity, 
Cao{s,v) + E ueu\{s t}c(M,'i')<c(r’, t), which is given as 

ao ^ ^|;■{mmyev{c{v, f) - ^)}) “ 1- (10) 

Similarly, it suffices to select a^+i sufficiently large so that for each vertex v such that (s, v) is of nonconstant 
capacity, c(v, 0 +X)iig( 7 \{s t} ^ '^)> which is obtained as ak+i t— 0. 

By following the above results, any GGT-type algorithms can be adapted to solve the problem Q. Algo¬ 
rithm shows an adaptation of the simplified version [5] of the original GGT algorithm. 
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6 Related Work 


Learning with structured sparsity-inducing regularization has been actively discussed in machine learning for 
a decade. Typical instances include (generalized) fused Lasso |Sni[5T] and group Lasso [SHlIllllS]- Generalized 
fused Lasso is closely related to the so-called total variation, which has often been discussed in computer 
vision [45] . Recently, group penalties have been applied to more complex groups, such as hierarchical 
penalty [571 HZ] and path penalty m- The total variation regularization is known to be solvable with an 
efficient parametric maxflow algorithm uniiis]. In addition, the proximal problem for Zi/?oo-group penalty is 
calculated via parametric maxflow optimization [36] . The proposed optimization formulation includes these 
formulations as special cases. Bach (2010) |3] and Bach & Obozinski (2012) [H] revealed that many of the 
existing structured penalties are obtained as convex relaxations of submodular functions, and those proximal 
problems are formulated as separable convex minimization. 

The sufficient condition in Section is closely related to the class of energy minimization problems solvable 
by graph-cut algorithm 0130]. Energy minimization is a formulation of the maximum a posteriori (MAP) 
estimation on MRFs (see, for example, [53]). Similar results are found in the context of realization of a 
submodular function as a cut function in combinatorial optimization laiii!. 

Algorithm is a divide-and-conquer implementation of the preflow algorithm proposed by Gallo & Tar- 
jan (1988) [T5]. Bach (2010) |5] and Bach & Obozinski (2012) |5T] have mentioned an application of a 
divide-and-conquer approach to separable convex minimization proposed by Groenevelt (1991) [ZT] for prox¬ 
imal problem ([^, which takes 0{d) times of the cost for submodular minimization. Algorithm [T takes the 
cost for only a single run of the preflow algorithm by adapting Gallo et al. (1989) [Hj’s algorithm to the 
current problem. 


7 Runtime Comparisons 

Here, we show empirical runtime comparisons of our algorithm with some existing ones based on different 
principles to see the scalabilities of the algorithms. The experiments were run on a 2.6 GHz 64-bit work¬ 
station using G-f-k. We applied our algorithm (we refer it as ‘PARA’) to the proximal problems for the 
penalty from F{A) = ffl> 1} (P = 2,oo) (as a typical example of penalties described in 3.1.1) 

and the (generalized) fused penalty (as one described in 3.1.2). We compared ours with the following algo¬ 
rithms; the decomposition algorithm described in (mis] with the minimum-norm-point (MNP) algorithm / 
the maxflow algorithm (‘DA-MNP’/’DA-MF’) and the algorithm by [36] (‘MJOB’)[^ for the penalties from 
pI) 1} (MJOB is applicable only for p=oo), and the MNP algorithm (‘MNP’), the 
algorithm by Tibshirani & Taylor (2011) [5T| (‘TT’) and the one by Liu et al. (2010) [32] (‘LYY’)[^for the 
(generalized) fused penalty (LYY is applicable only to the Id fused case). 

We generated data as follows. First, we generated a random vector z from the uniform distribution in 
[—1,1]“*. For generalized fused penalty, we randomly generated a directed network over nodes corresponding 
to V using GENRMF from DIMAGS Ghallengej^ And for generating overlapping groups, we randomly 
generated d/20-d/10 groups of size 30-100. The graphs in Figurej^show the empirical runtimes (in logarithm 
scale) for the algorithms. The plotted points are the averaged values over 10 randomly generated datasets. 


8 Conclusions 


In this paper, we provided a comprehensive class of structured penalties for which the proximal problem can 
be solved via an efficiently-solvable class of parametric maxflow optimization. Then, we showed that the 
parametric maxflow algorithm by Gallo et al. (1989) [TT] and its variants, which runs at the cost of a constant 

^We used the code modified from the one available at http://spams-devel.gforge.inria.fr/ 

®We used the code available at http://www.yelab.net/software/SLEP/ 

^The first DIMACS Int’l Algorithm Implementation Challenge (http://dimacs.rutgers.edu/Challenges/). 


10 







(a) overlapping group penalty from F{A) = (b) overlapping £i/^co-group penalty 

E9min{l-4ng|,l}(p = 2) 




Figure 2: Runtime comparisons for calculating prox^^^. 

factor in the worst-case time bound of the corresponding maxflow optimization, is applicable to solve this 
problem. The runtime of the proposed algorithm was empirically compared to those of the state-of-the-art 
ones. 

Several avenues would be worth investigating: First, our formulation does not include the type of sparsity 
by the latent group penalties, such as |25j . As mentioned in the proximal problem for the penalties of 
Jacob et al. (2009) [5S] and its generalization can be solved as a minimum-cost flow problem, which is known 
to be calculated as a parametric maxflow problem if the costs are quadratic and only on edges connected to 
source/sink [23]. It would be important to consider an unified framework connecting the current and such 
problems in the future work. Also, it would be interesting to address a new structured penalty satisfying 
the developed condition for some specific application. 


A Review of the Preflow-Push Algorithm 


The preflow algorithm computes a maximum flow in a directed network Af [18] . We first define terminology 
to describe the algorithm. A preflow / on Af is a real-valued function on vertex pairs satisfying the capacity 
constant, the antisymmetry constraint, and the following relaxation of the conservation constraint 

'^f{vi,V 2)>0 for all V 2 GV\{s,t}. (11) 

viGU 


For a given preflow, we define the excess e(v) of a vertex v to be v) if v ^ s, or infinity if ?; = s. 

We call a vertex v ^ {s, t} active if e{v) > 0. A preflow is a flow if and only if Eq. (11) holds with equality 
for all V ^ {s,t}, i.e., e{v) = 0 for all v ^ {s,t}. A vertex pair {v,u) is a residual arc for / if ( u , m ) < c{v,u). 
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A path of residual arcs is a residual path. A valid labeling d for a preflow / is a function from the vertices to 
the nonnegative integers and infinity, such that d{t) = 0, d{s) = n, and d(v) < d{u) +1 for every residual arc 
{v,u). The residual distance df{v,u) from r; to m is the minimum number of arcs on a residual path from v 
to u, or infinity if there is no such a path. 

To implement the preflow algorithm, we use the incidence list I(v) for each vertex v. The elements of I{v) 
are the unordered pairs {r),w} such that {v,u) G E or {u,v) £ E. The algorithm consists of repeating the 
following procedure until no active vertices exist. Select any active vertex vi. Let {vi,V 2 ) be the current 
edge of z;i. Then, apply the appropriate one of the following three cases. 

Push: If d{vi) > d{v 2 ) and f(vi,V 2 ) < c{vi,V 2 ), send S = min{e(z;i), c(tii, r; 2 ) — fivi,V 2 )} units of flow from 
vi to V 2 , by increasing f{vi,V 2 ) and e{v 2 ) by 6, and by decreasing f(vi,V 2 ) and e{vi) by S. 

Get Next Edge: If d(vi) < d{v 2 ) or f{vi,V 2 ) = c(vi,V 2 ), and (vi,V 2 ) is not the last edge in /(t'l), replace 
(vi,V 2 ) as the current edge of vi with the next in/(tii). 

Relabel: If d{vi) < d(v 2 ) or f(vi,V 2 ) = c(vi,vi), and {vi,V 2 ) is the last edge in /(ui), replace d{vi) by 
min{(r)i,n 2 ) G I{vi), f{vi,V 2 ) < c{vi,V 2 )}+l and make the first edge in I{vi) the current edge of vi. 

When the algorithm terminates, / is a maximum flow. A minimum cut can be computed, after replacing 
d{v) by min{d/(z;, s) + n,df{v, t)} for each £ F, as (A, A) such that A = {ri|(i(z;) > n}, where the sink side 
A is of minimum. The worst-case total time is 0(dm\og{d'^/m)) if we use dynamic trees for the selection of 
active vertices. 


B Details of Algorithm 


In this appendix, we describe the details of Algorithm for solving the proximal problem ([^. Especially, 
we give the closed-form solutions for finding a described in Lemma for p = 2, oo (i.e., r = 1,2), which is 
the key to make the complexity of Algorithm equivalent to the original GGT-type algorithm. 

First, from the definition (see, Eq. ®)), function ipiiTi) is represented as 


Ain) = ^ ^ 


iAVf/’' - (0 < T-i < (|z*|/A)’') 

((|z,|/A)’'<r.). 


I, 2^4 

Note that this function is non-increasing for ti (for such that 0 < < i\zi\/Xf , it is monotone). The 

derivative is given by 

(0 < T, < (|z,|/A)’') 

Aiin) = < A-n- ( 12 ) 

[O ((|z,|/A)’'<rO. 

This derivative is a non-decreasing function for (for r, such that 0 < r, < (|zi|/A)’^, it is monotone). Hence, 
Ai has an inverse function for 0 < < {\zi\/XY. 

To give an closed-form solution for a as in Eq. 0 and in Lemma it is sufficient to describe how we can 
find a satisfies for S' C E 

where c is some constant, which is stated in following parts for p = 2,oo, respectively. 


Case for p = 2 (r = 2) By substituting r = 2 into Eq. (12), we have for 0 < < (|zi|/A)^ 

Aiin) = ^ (^- ■ 

Therefore, fi and Tj such that Aiin) = AjiA) satisfy 


IzAfi = IzA^fi. 
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This means that, if a satisfies 4’i{^) — c, then we have 




Thus, we can calculate such a as 


a = 






Case for p = +oo (r = 1) By substituting r = 1 into Eq. (121, we have for 0 < < |zi|/A 

V'U'Ti) = A(ATi - \zi\). 

Thus, fi and fj such that satisfy 


Oi\ 


= - Tj). 


This means that, if a satisfies 4’i{^) — c, then we have 

^ £ , h\-Y.jes\^j\/\S\ 

-A-■ 

Hence, we can calculate such d as 


a = 





C Proofs 


Theorem 

(i) Let JV be the constructed network (see Figure[l]-(a)) with the additional node u. Then, for each A C V, 
we have k^({s}UA) = i(j(a 4) and k^({s} U AU {u}) = y. Hence, the constructed network indeed represents 
F(A) = min{i(;(H), y}. 

(ii) Let N={U = VVJWiJ {s,t},E) be the constructed network (see Figures [l|(c),(d)). We show that 

F{A) = mini^crv U ^ U H) — k^({s}) + F(0) for every A C V. It is easy to confirm that, for each 

A C V, the set {wb & W \ B C A} attains the minimum of minycrv k^({s} U HU Y). When H = 0, 
the minimum value is indeed K_,y({s}). Besides, when % ^ A Q V, \t increases by E«gv max{0, 

and decreases by EugvEaI \ACV with |H| >2}, which implies that 

minyciv U H U F) = «:xr({s}) + EaI F^A\)(^A) | 0 ^ H C H } = Kxt({s}) + F{A) - F(0). 


(iii) For a fixed set function F satisfying Condition (iii), we construct a directed network Af = {U = 
H U IT U {s,t},F) with nonnegative capacity c: E ^ K+ as follows. Then, Af coincides with the network 
just before Step 4 in the construction procedure in Section 4.2 (up to modular terms), and we have F{A) = 
minycw «:jy({s} U H U F) for every ACV. 

First, we define IF as the union of the following: 


W2:={wa\Ag(X)}, 

W+ := {zcA I v4 G Q with fA)(A) > 0}, 
Wg" := {wA I v4 G Q with fA){A) < 0}, 
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where each wa is an additional node adjacent to the nodes in A. Next, we define E as the union of the 
following: 


E+ :=Vx {t}, E^ := {s} x F, 

E 2 ■■= {s} X W 2 , E 21 ■■= { {wA,v) I WA G W 2 , V € A}, 
Et ■= ^ 3 *" -^13 := { {v,wa) I WA G W 3 +, V & A}, 

^3 ■= {s} X 1 ^ 3 ", E 31 := { {wA,v) I WA G 1 ^ 3 “, V G A}. 

Let us define a set function iL: 2^ —>■ M as 


H{A):=j:s{E^^'>iB)\ACBCV, wb G W+} (ACV), 
and the capacity function c: E —>■ IR_|_ as, for each e G E, 


c(e) := < 


^ max{0, ({u}) — H{{v})} (e = {v, t) G E ^) 
max{0, —+ H{{v})} (e = (s, v) G Ff) 
-F(^){A) - H{A) 
fA\A) 

-F(3)(A) 

+00 


(e = {s,wa) G E 2 ) 

(e = {wA,t) G ^ 3 ^) 

(e = (s, wa) G E ^) 

(e G F 21 U Fi 3 U F 31 ). 


The nonnegativity of c is guaranteed by the submodularity of F as follows: for any A = {u, u} C 1/ with 
|yl| = 2 , we have 

0 < min{ F{B \ {u}) + F{B \ {u}) - F{B) - F{B \ {u,v}) \ A C B C V } 

B 


= min i \acb'qb} 

^ y B' 

= mm|-F(2)(yl)-^|F(3)(F') 


AC B CV 


ACB' G 


AC B CV 


= -F^^\A) - max y Fi3)(Au{u}) 
BCV\A . 

“ v^B 

= -F^^\A) - H{A). 


We first check the value of minyciv KAfiY U {s}). If F n {W 2 U 1^3“) 7^ 0 , then at least one edge in F21 U F31 
contributes to the cut capacity, which makes it +00. Otherwise (i.e., if F C ^3*"), as no edge in E is from s 
to W^, we have Kj\f{Y U {s}) > «:7y({s}) for any F C W^, which means that F = 0 attains the minimum 
value. Without loss of generality, we assume F(°)( 0 ) = F( 0 ) = k^({s}) (i.e., Cp = 0), Then, it suffices to 
show that F{A) = minycw U k" U { s }) for each nonempty Gl C F. 

For any B CV with wb G W 2 U IF3", only the edge {s,wb) enters wb, and the edges {wb,v) {v G B) with 
c{wb,v) = +00 leave wb- Therefore, A B C A, we have 

KAf{A U F U {s, Wb}) = KAfiA U F U {s}) — c(s, wb) < KAfiA U F U {s}) 

for every F C IF \ {ws}. Moreover, for any B CV with wb G Wj}', only the edge (wbA) leaves wb, and 
the edges {v, wb) {v G B) with c{v, wb) = +00 enter wb- Thus, if F n xl 7^ 0 , we have wb G F and the edge 
{wbA) contributes to the cut capacity. Thus, the minimum value is attained by F := {wb G IF2 U IF3" | 
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B C A} U {wb & I i? n A ^ 0 }, and we have 
U y U {s}) - Kx^({s}) 

= '^ic{v,t) - c{s,v)) - ^ c{s,wb)+ ^ c{wB,t) 

*' 6 ^ WB&W2UW~ ■. BGA WB&W+■. Br^A^$ 

= ^(FW(z;)-H(n))+ ^ {F^^Hb) + H{B)) + ^ ^ ^ 

" 6 ^ WB^W2 . B<ZA wb&W^ :BCA wb&W;^ : BnA^Hi 

= Y. f^''''Hb)+ y - E^(^) + E 

B-.li^BCA wbGW^ : BnA^$^B\A " 6 ^ wb&W2-. BCA 

= F(A)-y(o)(0), 

which means K^(AuyU{s}) = y(^)- To see the last equality, it suffices to count the contribution of F^^^B') 
to the second to the last line, which is easily seen to be totally zero, for each B' C V with wb' & . 


Lemma 

The first is shown in Lemma 2 and 3 in [40j or Proposition 2.5 in [Q. The equivalence of optimal solutions 
to the two problems is obvious. 


Corollary 

First, from Proposition 8.8 in [5], we obtain a solution to problem (§ as 


h' 

1 sign(zi)(|zi|/A)^ 


if {\zi\/\Y > r*, 
otherwise. 


(13) 


Although the proposition assumes the strict convexity on separable functions, the above can be obtain 
since {—Yi)' is monotone for s.t. (| 2 :j|/A)’' > r*. Then, the corollary follows by solving analytically the 
minimization w.r.t. w in the definition of tpi- 


Lemma 

The statement of this lemma is shown by Appendix [B} 


Theorem 

The correctness follows the monotone source-sink property of the current network. The runtime follows the 
analysis in d] from Lemma [7) 
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